Skip to content

cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0#3305

Draft
yaoyu-33 wants to merge 1 commit intor0.4.0from
cherrypick/3283-to-r0.4.0
Draft

cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0#3305
yaoyu-33 wants to merge 1 commit intor0.4.0from
cherrypick/3283-to-r0.4.0

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Apr 13, 2026

Summary

Test plan

  • CI passes on r0.4.0 base
  • Verify NCCL env vars are set when recipe.ddp.nccl_ub=True

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Performance
    • Improved distributed training efficiency by enabling optimized NVIDIA collective communication settings based on runtime configuration, delivering enhanced performance when supported hardware and settings are available.

…3283)

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yaoyu-33 yaoyu-33 requested review from a team, erhoo82 and malay-nagda as code owners April 13, 2026 21:13
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

The script now conditionally sets NCCL environment variables (NCCL_NVLS_ENABLE and NCCL_CTA_POLICY) in main() when recipe.ddp.nccl_ub is enabled before invoking the training process.

Changes

Cohort / File(s) Summary
NCCL Configuration Setup
scripts/performance/run_script.py
Added conditional initialization of NCCL environment variables based on recipe.ddp.nccl_ub configuration flag.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly indicates the purpose of the PR: setting NCCL environment variables when nccl_ub is enabled via recipe config, and specifies it's a cherry-pick of PR #3283 into the r0.4.0 branch. This directly aligns with the changeset which adds conditional NCCL_NVLS_ENABLE and NCCL_CTA_POLICY environment variable setup in main().
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed Minor 5-line change with documented test plan including CI verification and NCCL environment variable validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherrypick/3283-to-r0.4.0

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/run_script.py`:
- Around line 99-101: The code uses os.environ.setdefault(...) for NCCL flags
which won't override existing values; change this to explicit assignment when
recipe.ddp.nccl_ub is truthy: set os.environ["NCCL_NVLS_ENABLE"] = "1" and
os.environ["NCCL_CTA_POLICY"] = "1" instead of setdefault so the flags are
enforced; update the block referencing recipe.ddp.nccl_ub and replace calls to
os.environ.setdefault with direct assignments to the two environment variables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b906de96-d88a-41c2-a6c5-5b59db86873a

📥 Commits

Reviewing files that changed from the base of the PR and between f6e8b66 and 2c109e9.

📒 Files selected for processing (1)
  • scripts/performance/run_script.py

Comment on lines +99 to +101
if getattr(recipe.ddp, "nccl_ub", False):
os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
os.environ.setdefault("NCCL_CTA_POLICY", "1")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Check new nccl_ub env logic in run_script.py:"
rg -n -C3 'nccl_ub|NCCL_NVLS_ENABLE|NCCL_CTA_POLICY|setdefault' scripts/performance/run_script.py

echo
echo "Check existing defaults in performance pipeline:"
rg -n -C3 '"NCCL_NVLS_ENABLE"\s*:\s*"0"|NCCL_CTA_POLICY' \
  scripts/performance/utils/executors.py \
  scripts/performance/setup_experiment.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2400


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check imports in run_script.py ==="
head -50 scripts/performance/run_script.py | grep -E "import|from"

echo
echo "=== Check function/module that calls run_script.py or sets up environment ==="
rg -l "run_script" scripts/performance/

echo
echo "=== Check other setdefault usage in scripts/performance/ ==="
rg -n "setdefault" scripts/performance/

echo
echo "=== Check direct assignment patterns for env vars in scripts/performance/ ==="
rg -n 'os\.environ\[.*\]\s*=' scripts/performance/ | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1117


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check if setup_experiment.py imports executors.py ==="
rg -n "import.*executor|from.*executor" scripts/performance/setup_experiment.py

echo
echo "=== Check how NCCL_NVLS_ENABLE is used in setup_experiment.py (context around line 340) ==="
sed -n '330,360p' scripts/performance/setup_experiment.py

echo
echo "=== Check executors.py to see how defaults are used ==="
sed -n '30,70p' scripts/performance/utils/executors.py

echo
echo "=== Check if run_script.py is executed as subprocess (checking for subprocess calls in setup_experiment) ==="
rg -n "subprocess|Popen|run\(" scripts/performance/setup_experiment.py | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3174


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check how executors handle env vars - look at function signatures and merging logic ==="
sed -n '70,150p' scripts/performance/utils/executors.py

echo
echo "=== Check kubeflow_executor to see how custom_env_vars is used ==="
rg -n "def kubeflow_executor" -A 30 scripts/performance/utils/executors.py

echo
echo "=== Check how env vars are passed to subprocess/container in executors ==="
rg -n "PERF_ENV_VARS|custom_env_vars" scripts/performance/utils/executors.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5308


🏁 Script executed:

#!/bin/bash
set -euo pivefail

echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py

echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py

echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 124


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py

echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py

echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2108


Use explicit assignment instead of setdefault() for NCCL environment variable enforcement

On lines 100-101, os.environ.setdefault() will skip overriding any pre-existing values. When recipe.ddp.nccl_ub is true, the intent is to explicitly enable these settings. Using setdefault() allows pre-existing "0" values to persist, which contradicts the intent and violates the code style guideline to "be as explicit as possible."

Proposed fix
    if getattr(recipe.ddp, "nccl_ub", False):
-        os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
-        os.environ.setdefault("NCCL_CTA_POLICY", "1")
+        os.environ["NCCL_NVLS_ENABLE"] = "1"
+        os.environ["NCCL_CTA_POLICY"] = "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if getattr(recipe.ddp, "nccl_ub", False):
os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
os.environ.setdefault("NCCL_CTA_POLICY", "1")
if getattr(recipe.ddp, "nccl_ub", False):
os.environ["NCCL_NVLS_ENABLE"] = "1"
os.environ["NCCL_CTA_POLICY"] = "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/run_script.py` around lines 99 - 101, The code uses
os.environ.setdefault(...) for NCCL flags which won't override existing values;
change this to explicit assignment when recipe.ddp.nccl_ub is truthy: set
os.environ["NCCL_NVLS_ENABLE"] = "1" and os.environ["NCCL_CTA_POLICY"] = "1"
instead of setdefault so the flags are enforced; update the block referencing
recipe.ddp.nccl_ub and replace calls to os.environ.setdefault with direct
assignments to the two environment variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dingqingy-nv the AI comment for setdefault seems valid.
the value will not be updated to 1 if the env var already exists (regardless if its 1 or 0) in env inside container.

Why not use direct assignment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Apr 13, 2026

for patch

@ko3n1g ko3n1g marked this pull request as draft April 13, 2026 21:28
@dingqingy-nv
Copy link
Copy Markdown
Contributor

For context, this PR is to cherry-pick #3283, which is an interface fix for nccl_ub.

At this point, none of our benchmark use this feature so this fix is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants